fix: post-merge follow-ups (font dedup + /process retention)#3
Open
Whateverdoa wants to merge 2 commits into
Open
fix: post-merge follow-ups (font dedup + /process retention)#3Whateverdoa wants to merge 2 commits into
Whateverdoa wants to merge 2 commits into
Conversation
embed_all_fonts, outline_all_fonts and has_unembedded_fonts were each defined twice in the class. Python kept the second (later) copies, which lacked the q/Q graphics-state imbalance fix and the more thorough font-descriptor scan. The module-level aliases therefore bound to the inferior versions. Removed the second set so the q/Q-fixing implementations win. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
/api/pdf/process and /process-with-json-file now persist the original upload and the processed output into settings.storage_dir via FileManager, and prune files past the retention window. Retention is best-effort: failures are logged and never break a successful processing response. - Add _retain_pdf_files helper (best-effort save_original + save_processed + cleanup_old_files; never raises). - Un-xfail test_pdf_retention_flow now that the storage half is implemented. - gitignore pdf_storage/ and test_pdf_storage/ so retained PDFs are never accidentally committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Retention paths allow traversal
- Added _sanitize_filename() helper in FileManager that strips path components and traversal sequences using os.path.basename() before constructing destination paths, preventing client-controlled filenames from escaping storage directories.
Or push these changes by commenting:
@cursor push 151bbedc17
Preview (151bbedc17)
diff --git a/app/utils/file_manager.py b/app/utils/file_manager.py
--- a/app/utils/file_manager.py
+++ b/app/utils/file_manager.py
@@ -8,6 +8,20 @@
logger = logging.getLogger(__name__)
+
+def _sanitize_filename(filename: str) -> str:
+ """Strip path components and traversal sequences to prevent directory escape.
+
+ Returns a safe base filename. Falls back to 'file' if the result is empty
+ or consists only of dots.
+ """
+ base = os.path.basename(filename.replace("\\", "/"))
+ base = base.lstrip(".")
+ if not base or base == "..":
+ return "file"
+ return base
+
+
class FileManager:
"""
Manages storage and cleanup of PDF files.
@@ -37,7 +51,7 @@
try:
# Add timestamp to filename to avoid collisions and track time
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
- safe_filename = f"{timestamp}_{filename}"
+ safe_filename = f"{timestamp}_{_sanitize_filename(filename)}"
dest_path = self.original_dir / safe_filename
shutil.copy2(file_path, dest_path)
@@ -61,7 +75,7 @@
try:
# Add timestamp if not already present (though processed files usually have unique names)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
- safe_filename = f"{timestamp}_{filename}"
+ safe_filename = f"{timestamp}_{_sanitize_filename(filename)}"
dest_path = self.processed_dir / safe_filename
shutil.copy2(file_path, dest_path)You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit d97ec05. Configure here.
| if input_path and os.path.exists(input_path): | ||
| fm.save_original(input_path, input_filename) | ||
| if output_path and os.path.exists(output_path): | ||
| fm.save_processed(output_path, f"{reference or 'processed'}.pdf") |
There was a problem hiding this comment.
Retention paths allow traversal
High Severity
_retain_pdf_files passes client-controlled job_config reference and upload filename into FileManager without sanitizing path separators. FileManager joins those strings with Path, so values containing .. or / can write retained PDFs outside original/processed or into the sibling retention folder.
Reviewed by Cursor Bugbot for commit d97ec05. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Two loose ends flagged during the color-analysis review (PR #2), both unrelated to that feature.
1. Remove duplicate font methods in
PDFUtils(commit311fc50)embed_all_fonts,outline_all_fonts, andhas_unembedded_fontswere each defined twice in the class. Python keeps the second copy, which lacked the q/Q graphics-state imbalance fix (and a less thorough font-descriptor scan). The module-level aliases added in PR #2 therefore bound to the inferior versions. Removed the second set so the q/Q-fixing implementations win.Verified:
inspect.getsource(PDFUtils.outline_all_fonts)now containsfix_q_Q_imbalance.2. Wire retention storage into
/process(commitd97ec05)FileManageralready hadsave_original/save_processed/cleanup_old_files, but no endpoint called them — retention was dead code. Now/api/pdf/processand/process-with-json-filepersist the original upload + processed output intosettings.storage_dirand prune past the retention window._retain_pdf_fileshelper: best-effort (save_original+save_processed+cleanup_old_files), logs on failure, never raises — retention cannot break a successful processing response.test_pdf_retention_flownow that the storage half is implemented (the tuple→FileResponse TypeError was already fixed in PR feat: L02 color metadata analysis for /api/pdf/analyze #2)..gitignorepdf_storage/andtest_pdf_storage/so retained PDFs are never committed.Test plan
uv run pytest tests/→ 129 passed (was 128 + 1 xfail; retention test now passes)import mainsucceedsgit statusafter the suite🤖 Generated with Claude Code
Note
Medium Risk
Adds best-effort persistence of uploaded and processed PDFs during
/api/pdf/process*, which changes runtime storage behavior and cleanup timing. Also alters font processing utilities by removing duplicate method definitions, which could affect Ghostscript post-processing results.Overview
Enables runtime PDF retention for processing requests. The
/api/pdf/processand/api/pdf/process-with-json-fileendpoints now best-effort save the original upload and processed output viaFileManagerand prune files past the retention window (logging failures without breaking successful responses).Cleans up font utility duplication. Removes the duplicate
PDFUtilsimplementations ofembed_all_fonts,outline_all_fonts, andhas_unembedded_fontsso the intended versions are used.Also updates
.gitignoreto excludepdf_storage/andtest_pdf_storage/, and unmarks the retention test as expected-fail now that the persistence path is implemented.Reviewed by Cursor Bugbot for commit d97ec05. Bugbot is set up for automated code reviews on this repo. Configure here.